Data Description
This is a subset of the Human Cell Atlas Bone Marrow Big Dataset, including 380,000 cells from 8 Donors. DataSet
Analysis Workflow
The basic analysis workflow will follow these main steps:
flowchart TD
A((scRNA-seq Data)) --> B[Exploratory Data Analysis]
A --> C[Quality Control]
C --> D[Normalization]
B --> E[Feature Selection]
D --> E
E --> F[Dimensionality Reduction]
F --> G[Clustering]
G --> H[Automated Celltype Annotation]
%% Enhanced styling with lighter colors and distinct shapes
style A fill:#f0f0f0,stroke:#888,stroke-width:1px,stroke-dasharray: 5 5;
style B fill:#cfe,stroke:#888,stroke-width:1px;
style C fill:#cfe,stroke:#888,stroke-width:1px;
style D fill:#cfe,stroke:#888,stroke-width:1px;
style E fill:#e7f7ff,stroke:#888,stroke-width:1px;
style F fill:#e7f7ff,stroke:#888,stroke-width:1px;
style G fill:#e7f7ff,stroke:#888,stroke-width:1px;
style H fill:#f0f0f0,stroke:#888,stroke-width:1px,stroke-dasharray: 5 5;
General Stats
nCell: Number of cell in each sample
mean_UMI: Mean of total UMI counts per sample
mean_Gene: Mean of the total genes detected per sample
mean_Mito: Mean of the UMI counts that belong to mitochondria gene per sample
mean_Mito_percent: Mean percentage of mitochondria UMI per sample
Quality Control
Per Cell Quality Checking
This step uses the automated quality control function quickPerCellQC. It calculates median absolute deviation (MAD) thresholds to identify outliers in these metrics then flags those outliers for discarding process.
Total Counts (Library size): Cells with a library size below a certain threshold (e.g., 3 MADs below the median) are flagged.
Detected Feature (Genes): Cells with a low number of detected features (e.g., 3 MADs below the median) are flagged.
Mito Percent (Mitochondria Percent): Cells with a high percentage of mitochondrial reads (e.g., 3 MADs above the median) are flagged.
Low Quality Cells Filtering
These low-quality cells marked above will be discarded.
Normalization
This step leverage the function logNormCounts to normalizes the single-cell RNA-seq data by dividing each cell’s counts by its total count (library size) and then applying a log-transformation. This process adjusts for differences in sequencing depth across cells and prepares the data for downstream analysis.
Feature Selection
This code performs feature selection in single-cell RNA-seq analysis by identifying highly variable genes (HVGs). First, modelGeneVarByPoisson models the gene expression variance across cells, accounting for donor-specific effects to isolate biologically relevant variability. Then, getTopHVGs selects the top 5,000 most variable genes, which are likely to be biologically significant and are used in downstream analyses (clustering and dimensionality reduction).
Top 5000 HVGs
Clustering
- This step first runs UMAP on the dataset using MNN-based dimensionality reduction and Annoy for efficient nearest neighbor calculations, optimized for large datasets.
- Next, it performs two-step clustering: first applying K-means to create
1,000 initial clusters, then refining them with a nearest-neighbor graph using k=5.
Cluster Identification
Here is the clustering result:
Similarity in sample clusters
The distribution of cells across clusters and donors are show below, it provides a visual summary of how different donors contribute to each cluster.
Cluster visualization with UMAP
Visualize the clusters in UMAP plot.
Cell Type Annotation
Assigned Cell Type
This step performs automated cell type classification using a reference dataset to annotate each cluster based on its pseudo-bulk profile.
Reference dataset: HumanPrimaryCellAtlasData This reference dataset provides normalized expression values for 713 microarray samples from the Human Primary Cell Atlas (HPCA) (Mabbott et al., 2013). These 713 samples were processed and normalized as described in Aran, Looney and Liu et al. (2019).
UMAP visualization of Celltype
Visualize the assigned cell-type in UMAP plot.